Pokémon is a popular game series where Pokémon trainers catch and train fictional creatures called Pokémon to battle other trainers. I have chosen to focus this project on Pokémon as it is a franchise recognised by almost everyone. However, not everyone is aware of the depths of data behind these beloved pocket monsters. As there is a large competitive player base and regular additions to the series, it’s possible that powercreep might occur. Powercreep is a process where new content that’s introduced to a game series is continuously more powerful than older content. This leads to newer content being favoured by players, making older content redundant.

Data pokeball, source: https://pngimg.com/image/27658rigins

The dataset was retrieved from Kaggle and was published by Mario Tormo Romero. The project only uses the pokedex_(Update_05.20).csv file. The data was collated from pokemondb.net and www.serebii.net. It consists of a wealth of information on all known Pokémon species (and their variations) up to the ip’s eighth generation.

## # A tibble: 1,028 x 51
##        X pokedex_number name     german_name japanese_name     generation status
##    <int>          <int> <chr>    <chr>       <chr>                  <int> <chr> 
##  1     0              1 Bulbasa~ Bisasam     "フシギダãƒ\~          1 Normal
##  2     1              2 Ivysaur  Bisaknosp   "フシギソウ~          1 Normal
##  3     2              3 Venusaur Bisaflor    "フシギãƒ\u00~          1 Normal
##  4     3              3 Mega Ve~ Bisaflor    "フシギãƒ\u00~          1 Normal
##  5     4              4 Charman~ Glumanda    "ヒトカゲ (H~          1 Normal
##  6     5              5 Charmel~ Glutexo     "リザード (L~          1 Normal
##  7     6              6 Chariza~ Glurak      "リザードン~          1 Normal
##  8     7              6 Mega Ch~ Glurak      "リザードン~          1 Normal
##  9     8              6 Mega Ch~ Glurak      "リザードン~          1 Normal
## 10     9              7 Squirtle Schiggy     "ゼニガメ (Z~          1 Normal
## # ... with 1,018 more rows, and 44 more variables: species <chr>,
## #   type_number <int>, type_1 <chr>, type_2 <chr>, height_m <dbl>,
## #   weight_kg <dbl>, abilities_number <int>, ability_1 <chr>, ability_2 <chr>,
## #   ability_hidden <chr>, total_points <dbl>, hp <dbl>, attack <dbl>,
## #   defense <dbl>, sp_attack <dbl>, sp_defense <dbl>, speed <dbl>,
## #   catch_rate <dbl>, base_friendship <dbl>, base_experience <dbl>,
## #   growth_rate <chr>, egg_type_number <int>, egg_type_1 <chr>,
## #   egg_type_2 <chr>, percentage_male <dbl>, egg_cycles <dbl>,
## #   against_normal <dbl>, against_fire <dbl>, against_water <dbl>,
## #   against_electric <dbl>, against_grass <dbl>, against_ice <dbl>,
## #   against_fight <dbl>, against_poison <dbl>, against_ground <dbl>,
## #   against_flying <dbl>, against_psychic <dbl>, against_bug <dbl>,
## #   against_rock <dbl>, against_ghost <dbl>, against_dragon <dbl>,
## #   against_dark <dbl>, against_steel <dbl>, against_fairy <dbl>

As you can see, the dataset is quite large so analysis will only focus on some of the variables. The table below provides descriptions of the variables used in analysis. The full codebook can be found on the github repository.

Variable.name Format Description
pokedex_number numerical The entry number of the Pokemon in the National Pokedex
name string The English name of the Pokemon
generation (cleaned dataset) string The numbered generation which the Pokemon was first introduced presented as roman numerals
species string The Category of the Pokemon
combined_type (cleaned dataset) string The types of the Pokemon in alphabetical order
total_points numerical Total number of Base Points
hp numerical The Base HP of the Pokemon
attack numerical The Base Attack of the Pokemon
defense numerical The Base Defense of the Pokemon
sp_attack numerical The Base Special Attack of the Pokemon
sp_defense numerical The Base Special Defense of the Pokemon
speed numerical The Base Speed of the Pokemon

Research Questipokeball, source: https://pngimg.com/image/27658ns

The present visualisations aim to address the following questions:

  1. Has the introduction of new Pokémon in subsequent game generations resulted in powercreep?
  2. How has the distribution of base stats changed in new Pokémon across the generations?

Data Preparatipokeball, source: https://pngimg.com/image/27658n

Luckily there wasn’t too much data wrangling needed on this dataset.

cleaneddf <- rawdf

##change generation variable from numbers to roman numerals
cleaneddf$generation <- as.character(as.roman(cleaneddf$generation))

##rename defense and sp_defense to the English spellings
cleaneddf <- cleaneddf %>%
  rename(defence = defense) %>%
  rename(sp_defence = sp_defense)

##removing "Pokémon" from species values
#function to reverse strings by word
reverse_words <- function(string)
{
  # split string by blank spaces
  string_split = strsplit(as.character(string), split = " ")
  # how many split terms?
  string_length = length(string_split[[1]])
  # decide what to do
  if (string_length == 1) {
    # one word (do nothing)
    reversed_string = string_split[[1]]
  } else {
    # more than one word (collapse them)
    reversed_split = string_split[[1]][string_length:1]
    reversed_string = paste(reversed_split, collapse = " ")
  }
  # output
  return(reversed_string)
} 
#reverse word order in species column
cleaneddf$species <- sapply(cleaneddf[,8], reverse_words)

#removing "Pokémon" from species values
cleaneddf <- cleaneddf %>%
  separate(species, c(NA, "species"), sep = " ", extra = "merge", fill = "left")

#reverting word order in species column
cleaneddf$species <- sapply(cleaneddf[,8], reverse_words)

##merging type_1 and type_2 columns alphabetically
cleaneddf <- cleaneddf %>%
  rowwise() %>%
  mutate(combined_type = paste(sort(c(type_1, type_2)), collapse = " ")) %>%
  ungroup()

NB: the code for the reverse_words function was found at www.gastonsanchez.com.

Following data cleaning, I subset the data so that it only included the columns I will be using. Below is a table that shows the first 5 rows of the subset data.

## # A tibble: 5 x 12
##   pokedex_number name        generation species combined_type total_points    hp
##            <int> <chr>       <chr>      <chr>   <chr>                <dbl> <dbl>
## 1              1 Bulbasaur   I          Seed    "Grass Poiso~          318    45
## 2              2 Ivysaur     I          Seed    "Grass Poiso~          405    60
## 3              3 Venusaur    I          Seed    "Grass Poiso~          525    80
## 4              3 Mega Venus~ I          Seed    "Grass Poiso~          625    80
## 5              4 Charmander  I          Lizard  " Fire"                309    39
## # ... with 5 more variables: attack <dbl>, defence <dbl>, sp_attack <dbl>,
## #   sp_defence <dbl>, speed <dbl>

Summary statistics are presented below:

library(vtable) #required for sumtable function

#variables to include in the tables
variables <- c('total_points', 'hp', 'attack', 'defence', 'sp_attack', 'sp_defence', 'speed')

#table of summary statics across numerical variables
sumoverall <- sumtable(cleaneddf,
                       #specifying which variables to include in table
                       vars = variables,
                       #specifying which functions to include in the table
                       summ = list(c('median(x)', 'mean(x)', 'sd(x)', 'min(x)', 'max(x)')),
                       #specifying column names
                       summ.names = list(c('Median', 'Mean', 'SD', 'Minimum Value', 'Maximum Value')),
                       #how many decimal points to include
                       digits = 2,
                       #don't include trailing 0s
                       fixed.digits = FALSE,
                       #title of table
                       title = 'Summary statistics')
#print table
sumoverall
Summary statistics
Variable Median Mean SD Minimum Value Maximum Value
total_points 455 437.57 121.66 175 1125
hp 66.5 69.58 26.39 1 255
attack 76 80.12 32.37 5 190
defence 70 74.48 31.3 5 250
sp_attack 65 72.73 32.68 10 194
sp_defence 70 72.13 28.08 20 250
speed 65 68.53 29.8 5 180
#table of mean and sd by generation
sumbygroup <- sumtable(cleaneddf,
                       #specify variables
                       vars = variables,
                       #specify functions to include in table
                       summ = list(c('mean(x)', 'sd(x)')),
                       #specify column names
                       summ.names = list(c('Mean', 'SD')),
                       #group by the generation variable
                       group = "generation",
                       #how many decimal points
                       digits = 2,
                       #don't include trailing 0s
                       fixed.digits = FALSE,
                       #title of table
                       title = 'Mean and standard deviation by generation')
#print table
sumbygroup
Mean and standard deviation by generation
generation
I
II
III
IV
V
VI
VII
VIII
Variable Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
total_points 424.1 112.48 419.14 119.6 436.01 135.05 459.02 119.56 435.16 107.61 442.55 118.76 459.23 123.31 438.33 141.52
hp 64.76 27.35 71.48 30.4 66.81 23.96 73.08 25.11 72.29 22.61 69.88 26.1 71.46 27.22 70.47 30.01
attack 77.3 29.61 71.87 32.6 81.03 36.3 82.87 32.78 82.98 30.99 77.19 29.83 87.31 33.87 80 30.99
defence 71.01 28.59 73.82 39.19 74.05 34.73 78.13 30.15 72.27 23.03 77.02 31.2 79.2 32.39 75.12 33.59
sp_attack 69.93 33.68 66.12 27.8 75.6 35.13 76.4 31.91 70.93 32.1 75.26 32.45 77.85 35.63 71.77 28.87
sp_defence 68.6 24.9 74.34 31.51 71.14 30.73 77.19 27.5 68.43 22.17 75.2 29.65 75.56 28.73 72.45 32.51
speed 72.51 29.86 61.51 27.31 67.38 31.03 71.34 28.48 68.26 29.16 68 26.78 67.85 31.26 68.51 33.49

NB: Median values were not included in the summary statistics by generation because they are available in the visualisation

Visualisatipokeball, source: https://pngimg.com/image/27658ns

First, let’s visualise whether there is any evidence of powercreep:

library(plotly) #required to build the boxplot

###visualisation 1: total stats across generations

#graph colour palette - each colour was inspired by a game that was released in that generation
gen_colors <- c("#fad61d", #pokemon yellow
                "#b4c5f6", #pokemon silver
                "#5abd8b", #pokemon emerald
                "#bd6ad5", #pokemon pearl
                "#202029", #pokemon black
                "#015f9f", #pokemon x
                "#f59423", #pokemon sun
                "#e5005a") #pokemon shield

#extra information to add to datapoints on hover
text <- ~paste(' Name: ', name,
               '</br> Pokedex No: ', pokedex_number,
               '</br> Type: ', combined_type,
               '</br> Species: ', species)
#graph type
graph_type <- c("box")

#show boxpoints
boxpoints <- c("all")

#width of jitter
jitter <- c("1")

#position of jitter 
pointpos <- c("-2")

#graph dimensions
width <- c("900")
height <- c("750")

#set up base plot
figtot <- plot_ly(cleaneddf,
                  #set y variable
                  y = ~total_points,
                  #specifying that a different colour should be used for each pokemon generation
                  color = ~generation,
                  #specifying the colour palette
                  colors = gen_colors,
                  #specify graph type
                  type = graph_type,
                  #show datapoints
                  boxpoints = boxpoints,
                  #include jitter
                  jitter = jitter,
                  #jitter to display to the left of each boxplot
                  pointpos = pointpos,
                  #specify graph dimensions
                  width = width,
                  height = height,
                  #adds extra information on hover, x & y values included by default
                  text = text) %>%
  #add layout information
  layout(
  #add title
  title = "Total base statistics across Pokémon generations",
  #add x axis label
   xaxis = list(title = list(text = "Generation")),
  #add  y axis label
  yaxis = list(title = list( text = "Total points")),
  #do not show legend
  showlegend = FALSE)

#plot boxplot
figtot

We can see that the median total points were noticeably lower in the first three generations compared to the later generations. However, the average number of total points have remained relatively stable across generations.

There is also a general trend of the interquartile range increasing across the generations (gen 3 is the notable exception due to its large iqr compared to others). This suggests that there’s more variety in the total base points in assigned to newer Pokémon.

What about if we look at individual stats?

###visualisation 2: boxplot showing a breakdown of stats across generations

#colours for traces were inspired by bulbapedia's base stats display
hpcol<- "red"
attcol <- "orange"
defcol <- "darkmagenta" #replaced yellow to make visualisation clearer
specattcol <- "deepskyblue"
specdefcol <- "green"
speedcol <- "hotpink"

fig <- plot_ly(cleaneddf,
               #specify graph type
               type = graph_type,
               #set graph dimensions
               width = width,
               height = height,
               #adds hovertext
               text = text)
#add plots
fig <- fig %>% add_trace(type = graph_type, x = ~hp, y = ~generation, name = "Health points", color=I(hpcol))
fig <- fig %>% add_trace(type = graph_type, x = ~attack, y = ~generation, name = "Attack", color=I(attcol))
fig <- fig %>% add_trace(type = graph_type, x = ~defence, y = ~generation, name = "Defence", color=I(defcol))
fig <- fig %>% add_trace(type = graph_type, x = ~sp_attack, y = ~generation, name = "Special attack", color=I(specattcol))
fig <- fig %>% add_trace(type = graph_type, x = ~sp_defence, y = ~generation, name = "Special defence", color=I(specdefcol))
fig <- fig %>% add_trace(type = graph_type, x = ~speed, y = ~generation, name = "Speed", color=I(speedcol))

fig <- fig %>% layout(
  #grouping traces by generation
  boxmode = "group",
  #add title
  title = "Base statistics across Pokémon generations",
  #reverse axis so Gen 1 shows at the top
   yaxis = list(autorange = "reversed",
                #add y axis label
                title = "Generation"),
  #add  x axis label
  xaxis = list(title = list( text = "Points"))
  )

#plot boxplot
fig

From the visualisation, we can see that:

Summary

Overall, since the fourth generation there is no evidence of powercreep within the Pokémon games. Rather than adding increasingly more powerful Pokémon, focus may be on adding more specialised Pokémon. This would also explain why there are few trends in the distribution of individual stats across generations. Further evidence for specialisation comes from the fact that other than Eternatus Eternamax (who is an outlier on total base points), no Pokémon is an outlier on more than two of the base statistics.

The main limitation with this dataset is that the data represents the most up to date stats for each Pokémon. Some Pokémon’s stats have changed across the generations e.g. in generation one, special attack and special defense were both represented by a stat called special, which was separated in subsequent games. Therefore, this dataset may mask evidence of powercreep. It would be interesting to replicate these visualisations using the stats from when a Pokémon was introduced. Another option with the present dataset would be to compare stats across Pokémon types to see whether one type is superior.

It would also be interesting to compare the most popular Pokémon with their base statistics. Do people like a specific Pokémon because they are the best, or do other factors such as nostalgia or cuteness come into play? Data for this could come from the Pokémon of the year survey conducted by Pokémon or Google search data.

Dependencies

This file was created using:

Packrat was used for package management.

If you would like to run this code, please download unbundlepackrat.r and assignment-2021-05-20.tar from the github repository. Instructions on how to unbundle assignment-2021-05-20.tar can be found within unbundlepackrat.r.

The full repository for this analysis can be found here.